Recursive Array Layouts and Fast Matrix Multiplication
نویسندگان
چکیده
The performance of both serial and parallel implementations of matrix multiplication is highly sensitive to memory system behavior. False sharing and cache con icts cause traditional columnmajor or row-major array layouts to incur high variability in memory system performance as matrix size varies. This paper investigates the use of recursive array layouts to improve performance and reduce variability. Previous work on recursive matrix multiplication is extended to examine several recursive array layouts and three recursive algorithms: standard matrix multiplication, and the more complex algorithms of Strassen and Winograd. While recursive layouts signi cantly outperform traditional layouts (reducing execution times by a factor of 1.2{2.5) for the standard algorithm, they o er little improvement for Strassen's and Winograd's algorithms. For a purely sequential implementation, it is possible to reorder computation to conserve memory space and improve performance between 10% and 20%. Carrying the recursive layout down to the level of individual matrix elements is shown to be counter-productive; a combination of recursive layouts down to canonically ordered matrix tiles instead yields higher performance. Five recursive layouts with successively increasing complexity of address computation are evaluated, and it is shown that addressing overheads can be kept in control even for the most computationally demanding of these layouts.
منابع مشابه
Using Non-canonical Array Layouts in Dense Matrix Operations
We present two implementations of dense matrix multiplication based on two different non-canonical array layouts: one based on a hypermatrix data structure (HM) where data submatrices are stored using a recursive layout; the other based on a simple block data layout with square blocks (SB) where blocks are arranged in column-major order. We show that the iterative code using SB outperforms a re...
متن کاملLow Complexity and High speed in Leading DCD ERLS Algorithm
Adaptive algorithms lead to adjust the system coefficients based on the measured data. This paper presents a dichotomous coordinate descent method to reduce the computational complexity and to improve the tracking ability based on the variable forgetting factor when there are a lot of changes in the system. Vedic mathematics is used to implement the multiplier and the divider in the VFF equatio...
متن کاملRecursion removal in fast matrix multiplication
Recursion’s removal improves the efficiency of recursive algorithms, especially algorithms with large formal parameters, such as fast matrix multiplication algorithms. In this article, a general method of breaking recursions in fast matrix multiplication algorithms is introduced, which is generalized from recursions removal of a specific fast matrix multiplication algorithm of Winograd.
متن کاملGeneric support of algorithmic and structural recursion for scientific computing
Recursive algorithms, like quick-sort, and recursive data structures, like trees, play a central role in programming. In the context of scientific computing, recursive algorithms and memory layouts are studied to provide excellent cache and TLB locality independently of the platform. We show how, for the first time, generic programming (GP) and OO allow us to abstract a multitude of dense-matri...
متن کاملFast recursive matrix multiplication for multi-core architectures
In this article, we present a fast algorithm for matrix multiplication optimized for recent multicore architectures. The implementation exploits different methodologies from parallel programming, like recursive decomposition, efficient low-level implementations of basic blocks, software prefetching, and task scheduling resulting in a multilevel algorithm with adaptive features. Measurements on ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- IEEE Trans. Parallel Distrib. Syst.
دوره 13 شماره
صفحات -
تاریخ انتشار 2002